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Abstract 

Distance computation (e.g., computing shortest paths) is one of the most fundamental primi- 
tives used in communication networks. The cost of effectively and accurately computing pairwise 
network distances can become prohibitive in large-scale networks such as the Internet and Peer- 
to-Peer (P2P) networks. To negotiate the rising need for very efficient distance computation at 
scales never imagined before, approximation techniques for numerous variants of this question 
have recently received significant attention in the literature. Several different areas of theoret- 
ical research have emerged centered around this problem, such as metric embeddings, distance 
labelings, spanners, and distance oracles. The goal is to preprocess the graph and store a small 
amount of information such that whenever a query for any pairwise distance is issued, the 
distance can be well approximated (i.e., with small stretch) very quickly in an online fashion. 
Specifically, the pre-processing (usually) involves storing a small sketch with each node, such 
that at query time only the sketches of the concerned nodes need to be looked up to compute 
the approximate distance. 

Techniques derived from metric embeddings have been considered extensively by the net- 
working community, usually under the name of network coordinate systems. On the other hand, 
while the computation of distance oracles has received considerable attention in the context of 
web graphs and social networks, there has been little work towards similar algorithms within the 
networking community. In this paper, we present the first theoretical study of distance sketches 
derived from distance oracles in a distributed network. We first present a fast distributed al- 
gorithm for computing approximate distance sketches, based on a distributed implementation 
of the distance oracle scheme of [Thorup-Zwick, JACM 2005]. We also show how to modify 
this basic construction to achieve different tradeoffs between the number of pairs for which the 
distance estimate is accurate, the size of the sketches, and the time and message complexity nec- 
essary to compute them. These tradeoffs can then be combined to give an efficient construction 
of small sketches with provable average-case as well as worst-case performance. Our algorithms 
use only small-sized messages and hence are suitable for bandwidth-constrained networks, and 
can be used in various networking applications such as topology discovery and construction, to- 
ken management, load balancing, monitoring overlays, and several other problems in distributed 
algorithms. 
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1 Introduction 



A fundamental operation on large networks is finding shortest paths between pairs of nodes, or 
at least finding the lengths of these shortest paths. This problem is not only a common building 
block in many algorithms, but is also a meaningful operation in its own right. In a distributed 
network such as a large peer to peer network, this may be useful in search, topology discovery, 
overlay creation, and basic node to node communication. However, given the large size of these 
networks, computing shortest path distances can, if done naively, require a significant amount of 
both time and network resources. As we would like to make distance queries in real time with 
minimal latency, it becomes important to use small amounts of resources per distance query. 

One approach to handle online distance requests is to perform a one-time offline or centralized 
computation. A straightforward brute force solution would be to compute the shortest paths 
between all pairs of nodes offline and to store the distances locally in the nodes. Once this has 
been accomplished, answering a shortest-path query online can be done with no communication 
overhead; however, the local space requirement is quadratic in the number of nodes in the graph (or 
linear if only shortest paths from the node are stored). For a large network containing millions of 
nodes, this is simply infeasible. An alternative, more practical approach is to store some auxiliary 
information with each node that can facilitate a quick distance computation online in real time. 
This auxiliary information is then used in the online computation that is performed for every request 
or query. One can view this auxiliary information as a sketch of the neighborhood structure of a 
node that is stored with each node. Simply retrieving the sketches of the two nodes should be 
sufficient to estimate the distance between them. Three properties are crucial for this purpose: 
first, these sketches should be reasonably small in size so that they can be stored with each node 
and accessed for any node at run time. Second, there needs to be a simple algorithm that, given 
the sketches of two nodes, can estimate the distance between them quickly. And third, even though 
the computation of the sketches is an offline computation, this cost also needs to be accounted for 
(as the distance information or network itself changes frequently, and this would require altering 
the sketches periodically). 

Sketches for the specific purpose of distance computation in communication networks have 
been referred to as distance labelings (by the more theoretical literature) and as network coordinate 
systems (by the more applied literature). There has been a significant amount of work from a 
more theoretical point of view on the fundamental tradeoff between the size of the sketches and 
the accuracy of the distance estimates they give (see e.g. Thorup and Zwick | TZ05|| , Gavoille et 
al. 1GPPR041 , Katz et al. jKKKP04| , and Cohen et al. 1CFI+09| ). However, all of these papers 
assumed a centralized computation of sketches, so are of limited utility in real distributed systems. 
In the networking community there has also been much work on constructing good network coor- 
dinate systems, including seminal work such as the Vivaldi system [DCKM04] and the Meridian 
system [WSS05|. While this line of work has resulted in almost fully functioning systems with 
efficient distributed algorithms, the theoretical underpinning of such systems is lacking; most of 
them can easily be shown to exhibit poor behavior in pathological instances. The main exception 
to this is Meridian, which has a significant theoretical component. However, it assumes that the 
underlying metric space is "low-dimensional" , and it is easy to construct high-dimensional instances 
on which Meridian does poorly. 

We attempt to move the theoretical line of research slightly closer to practice by designing 
efficient distributed algorithms for computing accurate distance sketches. We give algorithms with 
bounded round and message complexity in a standard model of distributed computation (the CON- 
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GEST model [PelOO]) that compute sketches that give distances estimates provably close to the ac- 
tual distances. In particular, we engineer a distributed version of the seminal centralized algorithm 
of Thorup and Zwick |TZ05[| . Their algorithm computes sketches that allow us to approximate dis- 
tances within a factor 2k — 1 (this value is known as the stretch) by using sketches of size 
for any integer k > 1. Up to the factor of k in the size, this is known to be a tight tradeoff in the 
worst case assuming a famous conjecture by Erdos [TZ05]. Note that this achieves its minimum 
size at k = logn, giving an 0(logn)-factor approximation to the distances using sketches of size 
0(log 2 n). We further extend these results to give a distributed algorithm based on the centralized 



algorithms of Chan et al. | CDG06 | that computes sketches with the same worst-case stretch and 
almost the same size, but with provably better average stretch. To the best of our knowledge, this 
is the first theoretical analysis of distributed algorithms for computing distance sketches. Our work 
can also be viewed as an efficient computation of local node-centric views of the global topology, 



which may be of independent interest for numerous different applications (cf. Section 2.1). 



1.1 Our Contributions 

Our main contributions are new distributed algorithms for various types of distance sketches. Most 
of the actual sketches have been described in previous work |TZ05| , |CDG06|1 , but we are the first 
to show that they can be efficiently constructed in a distributed network. While we formally define 



the model and problem in Section 2.2, at a high level we assume a synchronous distributed network 
in which in a single communication "round" every node can send a message of up to O(logn) 
bits (or 1 word) to each of its neighbors. Every edge has a nonnegative weight associated with 
it, and the distance between two nodes is the total weight of the shortest path (with respect to 
weights) between them. We begin in Section || by giving a distributed algorithm that constructs 
Thorup-Zwick distance sketches [TZ05] efficiently: 



Theorem 1.1. For any k > 1, there is a distributed algorithm that takes 0(kn 1 ^ k Slogn) rounds 
and 0(kn l l k S\E\ log n) messages, after which with high probability every node has a sketch of size 
at most 0{kn 1 ^ k \ogn) words that provides approximate distances up to a factor of 2k — 1. 



The value S in this theorem is known as the shortest-path diameter [KP08], and is (informally) 
the maximum over all (!J) shortest paths (where "short" is determined by weight) of the number 
of hops on the path. (In an unweighted network, S is the same as the network diameter D, hence 
S can be thought of as a generalization of D in a weighted network). Note that this is essentially 
a lower bound on any distance computation. 

In Section Q we show how to extend the techniques of Chan, Dinitz, and Gupta JCDGOC ] and 
combine them with the techniques of Section ||] to give sketches with "slack" . Informally, a sketch 
has e-slack if the stretch factor (i.e. the distance approximation guarantee) only holds for a (1 — e)- 
fraction of the pairs, rather than all pairs. While this is a weaker guarantee, since some pairs have 
no bound on the accuracy of the distance estimate at all, both the size of the sketches and the time 
needed to construct them become much smaller. For example, when e is a constant (even a small 
constant) we can construct constant stretch sketches in only 0(S log 2 n) rounds. 

Theorem 1.2. For any e > and 1 < k < 0(log -), there is a distributed sketching algorithm that 

/ 1 \ l/k 

gives sketches with size at most 0(k (-logn) logn) words and stretch 8k — 1 with e-slack that 



completes in at most O ( kS (Mogn) 1 ^ logn) rounds and O (kS\E\ Q logra) i/K log n ) messages. 
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Finally, in Section 4.1 we extend the slack techniques further, allowing us to efficiently construct 
sketches with the same worst-case stretch as in Theorem LI (with k = O(logn)) but with average 
stretch only 0(1): 



Theorem 1.3. There is a distributed sketching algorithm that gives sketches of size 0(log 4 
O (log n)- stretch and O(l) average stretch that completes in at most 0(S*log 4 n) 
0(S\E\ log 4 n) messages. 



n) with 



rounds and at most 



2 Related Work and Model 
2.1 Applications and Related Works 

Applications of approximate distance computations in distributed networks include token manage- 
ment p90| , PBF04|CTW93| , load balancing [ jKROij , small- world routing [pOeOOj , and search |ZSO§, 
ALPH01| , |Coo05| , |GMSOE| , |LCC + 02| . Several other areas of distributed computing also use dis- 



tance computations in a crucial way; some examples are information propagation and gather- 
ing pAS04| , |KKD01| ], network topology construction pMS05|, [L"S03| , |LKRG03| , monitoring over- 
lays [MG07], group communication in ad-hoc network [DSW06], gathering and dissemination of 
information over a network [|AKL+79|1 , and peer-to-peer membership management [ pKM03| , |ZSS05| . 

The most concrete application of our algorithms is to quickly computing approximate shortest 
path distances in networks, i.e. the normal application of network coordinate systems. In particular, 
in weighted networks, after using our algorithms to preprocess the network and create distance 
sketches we can compute the approximate distance between any two nodes in at most 0(D) times 
the size of the sketch rounds (where D, the hop-diameter, is the maximum over all pairs of nodes 
of the minimum number of hops between the nodes) by simply exchanging the sketches of the two 
nodes. On the other hand, note that any distance computation without using preprocessing (say, 
Dijkstra's algorithm, Bellman-Ford, or even a simple network ping to obtain the round-trip time) 
will take at least Q(S) rounds, where S is the shortest path diameter. This is less than ideal since S 
can be as large as n, the number of nodes in the networks, whereas D, the hop-diameter can be, and 
typically is, much smaller. Therefore our sketches yield improved algorithms for pairwise weighted- 
distance computations. Moreover, in networks such as P2P networks and overlay networks, using 
our algorithms a node can compute distances (number of hops in the overlay) in constant times size 
of sketch rounds if it simply knows the IP address of the other node: it can directly contact the 
other node using its IP address and ask for its sketch. Thus these sketch techniques can be very 
relevant and applicable even for unweighted distance computations. 

Probably the closest results to ours are from the theory behind Meridian | WSS05| , which is based 



on a modification of the "ring-of-neighbors" theoretical framework developed by Slivkins [31i05, 



Sli07] to prove theoretical bounds. However, there are some substantial differences. For one, 
the bounds given by these papers (including | WSS05| ) are limited to special types of metric 
spaces: those with either bounded doubling dimension or bounded growth. Our bounds hold 
for all (weighted) graphs (with, of course, weaker guarantees on the stretch). Furthermore, the dis- 
tributed framework used by Slivkins is significantly different from the standard CONGEST model 
of distributed computation that we use, and is based on being able to work in the metric completion 



of the graph. This means, for example, that the algorithms in [Sli07] have the ability to send a 
unit-size packet between any two nodes in 0(1) time (and the algorithms do make strong use of 
this ability). 
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2.2 Model and Notation 



We model a communication network as a weighted, undirected, connected n-node graph G = (V, E). 
Every node has limited initial knowledge. Specifically, assume that each node is associated with a 
distinct identity number (e.g., its IP address). At the beginning of the computation, each node v 
accepts as input its own identity number and the identity numbers of its neighbors in G. The node 
may also accept some additional inputs as specified by the problem at hand. The nodes are allowed 
to communicate through the edges of the graph G. We assume that the communication occurs 
in synchronous rounds. We will use only small-sized messages. In particular, in each round, each 
node v is allowed to send a message of size O(logn) through each edge e = (v,u) that is adjacent 
to v. The message will arrive at u at the end of the current round. We also assume that all edge 
weights are at most polynomial in n, and thus in a single round a distance or node ID can be sent 
through each edge. A word is a block of O(logn) bits that is sufficient to store either a node ID or 
a network distance. 

This is a widely used standard model to study distributed algorithms (called the CONGEST 
model, e.g., see [PclOO, PK09| ) and captures the bandwidth constraints inherent in real- world com- 
puter networks. (For example, classical network algorithms studied include algorithms for shortest 
paths (Bellman- Ford, Dijsktra), minimum spanning trees etc.) Our algorithms can be easily gen- 
eralized if B bits are allowed (for any pre-specified parameter B) to be sent through each edge in 
a round. Typically, as assumed here, B = O(logn), which is number of bits needed to send a node 
id in an n-node network. We assume that n (or some constant factor estimate of n) is common 
knowledge among nodes in the network. 

Every edge in the network has some nonnegative weight associated with it, and the distance 
between two nodes is the minimum, over all paths between the nodes, of the sum of the weights of 
the edges on the path. In other words, the normal shortest-path distance with edge weights. We 
let d(u,v) denote this distance for all u,v £ V. For a set A C V and a node u £ V, we define the 
distance from the node to the set to be d(u, A) = min{d(u, a) : a £ A}. For a node u £ V and real 
number r £ R— , the ball around u of radius r is defined to be B(u, r) = {v £ V : d(u, v) < r}. 

The hop-diameter D of G is defined to be the maximum over all pairs u, v in V of the number 
of hops between u and v. In other words, its the maximum over all pairs of the distance between u 
and v but where distance is computed assuming that all edge weights are 1, rather than their actual 
weights. The shortest-path diameter S of G is slightly more complicated to define. For u, v € V, let 
V u ,v be the set of simple paths between u and v with total weight equal to d(u, v) (by the definition 
of d(u,v) there is at least one such path). Let h(u,v) be the minimum, over all paths in P £ V u ,v> 
of the number of hops in P (i.e. the number of edges). Then S = max Ui „ g v h(u, v). It is easy to see 
that D < S and in general, any method of computing the distance from u to v must use at least S 
rounds (or else the shortest path will not be discovered). 



3 Distributed Sketches 

We are concerned with the problem of constructing a distance labeling scheme in a distributed 
manner. Given an input (weighted) graph G = (V,E), we want a distributed algorithm so that 
at termination every node u £ V knows a small label (or sketch) L(u) with the property that 
we can (quickly) compute an approximation to the distance between u and v just from L(u) and 
L(v). Since the requirement of sketch sizes and latency in distance computation may vary from 



5 



application to application, typically one would like a trade-off between the distance approximation 
and these parameters. 



3.1 Thorup-Zwick Construction 



The famous algorithm for constructing distance sketches in a centralized manner is Thorup-Zwick [ rZ05 |, 
which works as follows. They first create a hierarchy of node sets: Aq = V, and for 1 < i < k — 1, 
we get Ai by randomly sampling every vertex in ^4j_i with probability n _1//fc , i.e. every vertex in 
Ai^i is included in Ai with probability n~ x l k . We set A^ = and d(u,Ak) = oo by definition. 

Let Pi(u) be the vertex in Ai with minimum distance from u. Let Bi{u) = {w £ Ai : d(u,w) < 
d(u, Ai + i)}, and let B(u) = U^Tq 1 Bi{u) (for now we will assume that all distances are distinct; this 
can be made without loss of generality by breaking ties consistently through processor IDs or some 
other method). B(u) is called the bunch of u. The label L(u) of u consists of all nodes {pi(u)} k ~Q 
and B(u), as well as the distances to all of these nodes. Thorup and Zwick showed that these 
labels are enough to approximately compute the distance, and that these labels are small. We give 
sketches of these proofs for completeness. 

Lemma 3.1 ( frZ05| l). For all u € V, the expected size of L(u) is at most 0(kn l / k ) words. 

Proof. We prove that the expected size of Bi(u) is at most n l / k for every < % < k — 1, which 
clearly implies the lemma via linearity of expectation. Suppose that we have already made the 
random decisions that define levels Aq, . . . , Ai, and now for each v 6 Ai we flip the coin to see if it 
is also in Ai + \. If we flip these coins in order of distance from u (this is just in the analysis; the 
algorithm can flip the coins simultaneously or in arbitrary order) then the size of Bi(u) is just the 
number of coins we flip before we see a heads, where the probability of flipping a heads is 
In expectation this is n l l k . □ 

Lemma 3.2 ( JTZ05| ]). Given L(u) and L(v) for some u, v S V, we can compute a distance estimate 
d'(u,v) with d(u,v) < d'(u,v) < (2k — l)d(u,v) in time 0(k). 

Proof. For each < % < k — 1, we check whether pi(u) G Bi(v) or pi(v) € Bi(u). Let i* be the 
first level at which at least one of these events occurs. Note that i* is well-defined and is at most 
k — 1, since by definition Pk-i(u) € Bk-\(v) and Pk-i(v) € B^-iiu). If the first condition is true 
then we return distance estimate d'(u,v) = d(u,pi*(u)) + d(v,pi*(u)), and if the second condition 
is true then we return d'(u,v) = d(u,pi* (v)) + d(v,pi*(v)). Note that the necessary distances are 
in the labels as part of Bi*(u) and Bi*(v), so we can indeed compute this from L(u) and L(v). 

We first prove by induction that d(u,pi(u)) < i ■ d(u, v) and d(v,pi(v)) < i ■ d(u, v) for all i < i*. 
In the base case, when i = 0, both inequalities are true by definition. For the inductive step, let 
1 < i < i*. Since i < i* we know that i — 1 < i*, so pi^i(u) ^ B,;_i(t>) and pi-\[y) g" Bi-\(u). This 
implies that d(v,pi(v)) < d(v,pi-i(u)) < d(v,u) + d(u,pi-\(u) < d(u, v) + (i — l)d(u, v) = i ■ d(u,v), 
where the first inequality is from i — 1 < i* , the second is from the triangle inequality, and the third 
is from the inductive hypothesis. Similarly, we get that and d(u,pi(u)) < i ■ d(u,v). 

Now suppose without loss of generality that p^ (v) € (u) (if the roles are reversed we can just 
switch the names of u and v). Then our distance estimate is d'(u,v) = d(u,pi*(v)) +d(v,pi*(v)) < 
d(u, v) + 2d(v,pi* (v)) < d(u, v) + 2i*d(u, v) = (2i* + l)d(u, v), where the first inequality is from the 
triangle inequality and the second is from our previous inductive proof. Since i* < k — 1, this gives 
a stretch bound of 2k — 1 as claimed. □ 
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3.2 Distributed Algorithm 



The natural question is whether we can construct these labels in a distributed manner. For a vertex 
v £ Ai \ A i+ i, let C(v) = {w £ V : d(w,v) < d(w,A i+ i). This is called the cluster of v. Note that 
the clusters are the inverse of the bunches: u £ C{v) if and only if v £ B{u). So we will construct a 
distributed algorithm in which every vertex u knows exactly which clusters it is in and its distance 
from the centers of those clusters, and thus is able to construct its label. Also, it's easy to see that 
the clusters are connected: if u £ C(v) then obviously any vertex w on the shortest path from u to 
v is also in C(v). 

The distributed protocol is as follows: we first divide into k phases, where in phase i we deal 
with clusters from vertices in Ai \ Ai + \. However, we run the phases from top to bottom - we first 
do phase k — 1, then phase k — 2, down to phase 0. In phase i the goal is for every node u £ V to 
know for every node v £ V whether u € C(v), and if so, its distance to v. Thus every node u will 
know Bi(u) at the end of phase i. We will give an upper bound on the length of each phase with 
respect to n and S, so if every node knows both n and S they can all start each phase together 
by waiting until the upper bound is met. For now we will make the assumption that every node 
knows S (the shortest path diameter), thus solving the issue of synchronizing the beginning of the 
phases, but we will show in Section |3.3| how to remove this assumption. 

Let us first consider phase k — 1, i.e. the first phase that is run. This phase is especially simple, 
since Bk-i(u) = A^-i for every node u. So at the end of this phase we simply want every node 
to know all of the nodes in Ak_\ and its distances to all of them. This is known as the k-Source 
Shortest Paths Problem, and can be done in 0(\Ak~i\S) rounds using 0(1^4^— 1| \E\S) messages by 
running distributed Bellman- Ford from each node in A^-i simultaneously [ ]PK09[ ]. In particular, 
for a fixed source v £ A^-i every node u £ V runs the following protocol: initially, u guesses that 
its distance to v is d'(u,v) = oo. If it hears a message from a neighbor w that contains a distance 
a(w), then it checks if d(u,w) + a(w) < d'(u,v). If so, then it updates d'(u,v) to d(u,w) + a(w) 
and sends to all its neighbors a message that contains the new d'(u,v). This algorithm is given in 
detail as Algorithm |l[ 

Algorithm 1: Basic Bellman- Ford for u 
Initialization: d' = oo 

1 For each neighbor w of u, get message a(w) 

2 z i- min tueiV ( u) {a(w) + d(u, w)} 

3 if z < d' then 



4 
5 



<t<-z 

Send message d' to all neighbors 



In this description we assume that there is one source v that is known to all nodes, but this 
clearly is not necessary. With multiple unknown sources each message could also contain the ID 
of the source and each node u could keep track of its guesses d'(u,-) for every source that it has 
seen at least one message from. The standard analysis of Bellman- Ford (see e.g. [ PK09f| ) gives the 
following lemmas: 

Lemma 3.3. At the end of phase k — 1, every node u £ V knows which vertices are in A^-i as 
well as d(u,v) for all v £ Ak-\. 
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Lemma 3.4. Phase k—1 completes after at most 0(\Ak-i\S) rounds and at most 0(\E\ ■ \ Ak_i\S) 
messages. 

To handle phase i, we will assume inductively that Bi + \(u) is known to u at the start of phase i, 
as well as the distance from u to every node in Bi + \{u). In particular, we will assume that u knows 
its distance to the closest node in Ai + i, i.e. d(u,Ai + i). In phase i we will simply use a modified 
version of Bellman- Ford in which the sources are Ai \ Ai+i, but node u only "participates" in the 
algorithm for sources A4+1 when it gets a message that implies that d(u,v) < d(u,Ai + \), 

i.e. that v G Bi(u). To handle the multiple sources, each node u will maintain for every possible 
source v G V an outgoing message queue, which will only ever have a or 1 message in it. u 
just does round-robin scheduling among the nonempty queues, sending the current message to all 
neighbors and removing it from the queue. To simplify the code, we will assume without loss of 
generality that V = {0,1,... , n — 1} (this assumption is only used to simplify the round-robin 
scheduler, and can easily be removed). This algorithm is given as Algorithm ||. 

Algorithm 2: Modified Bellman-Ford for node u in phase i 

1 Initialization: 

2 foreach v G V \ {u} do 

3 d'(v) ^— oo 

4 q(v) «— 

5 i<-0 

6 In the first round: 

7 if u G Ai \ Ai + \ then 

8 |_ Send message (u, 0) to all neighbors 

9 In each round: 

// Receive and process new messages 
10 foreach w G N(u) do 
n Get message m{w) = {v w ,a w ) 

12 if a w + d(u, w) < d(u, Ai+i) A a w + d(u, w) < d'(v w ) then 

13 d'{v w ) ^— a w + d(u, w) 

14 |_ q(v w ) <- 1 

// Send message from next nonempty queue 

15 i' 4- i 

16 i ^— (i + l)%n 

17 while q(i) == A i / i' do i ^— (i + l)%n 

18 if q(i) == 1 then 

19 Send message (i,d'(i)) to all neighbors 

20 q(i) ^— 



Lemma 3.5. At the end of phase i, every node u G V knows Bi{u) and its distance to all nodes 
in Bi{u). 

Proof. We prove this by induction on the phase. The base case is phase k — 1, which is satisfied by 
Lemma [3.3| , Now consider some phase i > 0. Let v G Ai \ Ai+i - we will show by induction on the 
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hop count of the shortest path that all nodes u G C(v) find out their distance to v. If u 6 C(v) is 
adjacent to v via a shortest path, then obviously after the first round it will know its distance to v. 
If u € C(v) is not adjacent to v via a shortest path, then by induction the next hop on the shortest 
path from u to v finds out its correct distance to v , and thus will forward the announcement to u. 
Thus at the end of phase i every node u € C(ti) knows its distance from v, and since this holds for 
every v 6 A{ \ Ai + \ we have that every u € V knows its distance to all nodes in Bi{u). □ 

Before we prove the time and message complexity bounds, we give a lemma that extends the 



expected size analysis of | TZ05 | to give explicit tail bounds on the probability that the construction 
is large: 

Lemma 3.6. For every i € {0, ...,k — 1} and every u £ V, the probability that \Bi{u)\ > 
0(n l > k Inn) is at most 1/n 3 . 

Proof. In order for |S,(u)| > 3n 1 / fc lnn, the closest Sn^lnn nodes in Ai to u must all decide not 
to be part of A4+1. Since the probability that any particular node in Ai joins Ai + \ is n _1//fc , the 
probability that this happens is at most (1 - n - 1 A') 3nl/fc ln ™ < e -3inn _ ]y n 3^ rj 

We can now bound the time and message complexity of each phase: 

Lemma 3.7. With high probability, each phase takes 0(n 1 / fc S' log n) rounds and 0{j)}l k S\E\ logn) 
messages. 

Proof. Let v G Ai \ A4+1. Intuitively, if v were the only vertex in Ai \ vli+i, then in phase i 
the algorithm devolves into distributed Bellman-Ford in which the only vertices that ever forward 
messages are vertices in C(v). This would clearly take O(S) rounds. In the general case, each 
vertex u only participates in 0(\Bi(u)\) = 0(n 1 / fc logn) of these shortest path algorithms, so each 
"round" of the original algorithm can be split up into Ofji 1 ^ log n) rounds to accommodate all of 
the different sources. Thus the total time taken is 0(n 1//fc £logn) as claimed. 

To prove this formally, let u € V and v 6 Ai\Ai+i, with v € Bi{u). Let v = Vo, v%, . . . , = u 
be a shortest path from v to u with the fewest number of hops. Thus I < S. We prove by 
induction that Vj receives a message (v,d(vj-i,v)) at time at most 0{n l l k j logn), which clearly 
implies the lemma. For the base case, in the first round v sends out the message (v, 0) to its 
neighbors, so v\ receives the correct message at time 1 < n 1 / fc logn. For the inductive step, 
consider node Vj. We know by induction that Vj-i received a message {v,d(vj-2,v)) at time at 
most t = 0(n 1//fc (j — 1) logn). If vj-\ already knew this distance from v, then it also already sent a 
message (or put one in the queue) informing its neighbors (vj in particular) about this. Otherwise, 
Vj-% puts a message in its outgoing queue at time t. Since the nonempty queues are processed in a 
round-robin manner, and by Lemma [D] at most 0(n 1 / fc log n) queues are ever nonempty throughout 
the phase, Vj-\ sends a message (v,d(vj-iv)) at time at most t + 0{n l l k log n) = 0(n 1//fc j log n), 
as claimed. 

The time complexity bound immediately implies the message complexity bound, since in every 
round there are at most 2 messages on each edge (one in each direction). □ 



Lemmas 3J, p.5| , and [3/i] obviously imply the following theorem: 

Theorem 3.8. For any k > 1, there is a distributed sketching algorithm that takes Oikn 1 ^ Slogn) 
rounds and 0(kn l l k S\E\ logn) messages, after which with high probability every node has a sketch 
of size at most 0(kn 1 ^ k logn) words (and expected size 0(kn 1 ^ k ) words) that provides approximate 
distances with stretch 2k — 1. 
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3.3 Termination Detection 



We now show how to remove the assumption that every node knows S. Note that we do not show 
how to satisfy this assumption, i.e. we do not give an algorithm that computes S and distributes it 
to all nodes. Rather, we show how to detect when a phase has terminated, and thus when a new 
phase should start. We use basically the same termination detection algorithm as the one used by 
Khan et al. [ KKM + 0S| , just adapted to our context. 

At the very beginning of the algorithm, even before phase k — 1, we run a leader election 
algorithm to designate some arbitrary vertex r as the leader, and then build a breadth-first search 
(BFS) tree T out of r so that every node knows its parent in the tree as well as its children. This 
can be done in 0(D) < O(S) rounds and 0(\E \ logn) messages [jKKM + 08"f . 

At the beginning of phase i, the leader r sends a message to all nodes (along T) telling them 
when they should start phase i, so they all begin together. We say that a node u is complete if 
either u Ai \ Ai + \ or every vertex in C(u) knows its distance to u (we will see later how to use 
echo messages to know when this is the case). So initially the only complete nodes are the ones 
not in Ai \A{ + \. Any such node that is also a leaf in T immediately sends a COMPLETE message 
to its parent in the tree. Throughout the phase, when any node has heard COMPLETE messages 
from all of its children in T and is itself complete, it sends a COMPLETE message to its parent. 

Now suppose that when running phase i, some node u gets a message m(w) = (v w ,a w ) from 
a neighbor w. There are two reasons that this message might not result in a new message added 
to the send queue: if a w + d(u,w) > d(u,Ai + \) (v w has not yet been shown to be in Bi(u)), or if 
a w + d(u,w) > d'(v w ) (u already knows a shorter path to v w ). Furthermore, even if m(w) does 
result in a new message added to the send queue, it might get superseded by a new message added 
to the queue with an updated value of d'[y w ) before the value based on m(w) can be sent. All 
three of these conditions can be tracked by u, so for each message m that u receives (say from 
neighbor w) it keeps track of whether or not it sends out a new message based on m. If it does not 
(one of the two conditions failed, or it was superseded), then it sends an ECHO message back to 
w, together with a copy of the message. If u does send out a new message based on m, then when 
it has received ECHO messages for m from all of its neighbors (except for w) it sends an ECHO 
message to w together with a copy of m. 

It is easy to see inductively that when a node u sends a message m, it will also know via ECHO 
messages when m has ceased to propagate in the network since all of its neighbors will have ECHO'd 
it back to u. So if u £ Ai \ Ai+i, and thus only sends out one message that has first coordinate 
u, it will know when this message has stopped propagating, which clearly implies that every node 
v € V that is in C(u) knows its correct distance to u (as well as the fact that u € Bi(v)). At this 
point u is complete, so once it has received COMPLETE messages from all of its children in T it 
will send a COMPLETE message to its parent. 

Once r has received COMPLETE messages from all of its children (and is itself complete) the 
phase is over. So r starts the next phase by sending a START message to all nodes using the T, 
and the next phase begins. 

It is easy to see that the ECHOs only double the number of messages and rounds, since any 
message sent along an edge corresponds to exactly one ECHO sent back the other way. Electing 
a leader and building a BFS tree take only a negligible number of messages and rounds compared 
to the bounds of Theorem 3.8 . Each node sends only one COMPLETE message, so there are at 
most 0(n) COMPLETE messages which is tiny compared to the bound in Theorem 3.8, and the 
number of extra rounds due to COMPLETE messages is clearly only 0(D). Thus even with the 
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extra termination detection, the bounds of Theorem |3.8| still hold. 

4 Sketches with Slack 

Let u, v € V. We say that v is e-far from u if \{w : d(u,w) < d(u,v)}\ > en, i.e. if v is not one of 
the en closest nodes to u. Given a labeling L(u) for each u E V, we say that it has stretch 2k — 1 
and e-slack if the distance that we compute for u and v given L{u) and L(v) is at least d(u,v) and 
at most (2/c — l)d(u, v) for all u, v € V where t> is e-far from u. Labelings with slack were previously 
studied by Chan, Dinitz, and Gupta [CDG06] and Abraham, Bartal, Chan, Dhamdhere, Gupta, 



Kleinberg, Neiman, and Slivkins [ABC_J_05]. The main technique of Chan et al. was the use of a 
new type of net they called a density net. For each u G V, let R(u,e) = inf{r : \B(u, r)| > en} 
be the minimum distance necessary for the ball around u to contain at least en points, and let 
B e (u) = B(u, R(u,e)) be this ball. We give a definition of density net that is slightly modified 
from [ |CDG06| ] in order to make it easier to work with in a distributed context. 

Definition 4.1. A set of vertices A C V is an e-density net if: 

1. For all u S V, there is a vertex v € A such that d(u, v) < R(u, e), and 

2. \N\ < ±2 Inn. 

Chan et al. give a centralized algorithm that computes an e-density net in polynomial time 
for any e. Their density nets are somewhat different, in that they contain only 1/e nodes but the 
closest net node to u is only guaranteed to be within 2R(u, e) instead of R(u, e). We modify these 
values in order to give a distributed construction, and in fact with these modifications it is trivial 
to build density nets via random sampling. 

Lemma 4.2. There is a distributed algorithm that, with high probability, constructs an e-density 
net in constant time. 

Proof. The algorithm is simple: every vertex independently chooses to be in N with probability 
The expected size of N is clearly 5 ~ Ln , and by a simple Chernoff bound (sec e.g. [ MR95 | ) 
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we have that the probability that \N\ > is at most e -( 2 oinn)/(3e) < i/ n 6/^ so t h e seC ond 

constraint is satisfied with high probability. 

For the first constraint, that for every vertex u there is some vertex v E B e (u) n N, we split 
into two cases depending on e. If e < 51 "" , then every node has probability 1 of being in N, so 
the condition is trivially satisfied. Otherwise we have e > 51 °" , so for every u we have \B e (u)\ > 
5 Inn and the expected size of B e (u) R A is exactly 5 Inn. Using a similar Chernoff bound (but 
from the other direction) gives us that the probability that \B € (u) Pi N\ is less than 1 is at most 
e ~(25inn)/8 < l/n 3 . Now we can just take a union bound over all u to get that the first constraint 
is satisfied with high probability. □ 

Using this construction, we can efficiently construct short sketches with e-slack: 

Theorem 4.3. There is a distributed algorithm that uses at most 0{S-\ogn) rounds and at most 
0(S\E\- logn) messages so that at the end of the algorithm, every node has a sketch of size at most 
O(-logn) words with stretch at most 3 and e-slack. 
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Proof. The algorithm first uses Lemma 4.2 to construct an e-density net N. It is easy to see that 
the closest net points to u and v are a good approximation to the distance between u and v. In 
particular, suppose that v is e-far from u, let u' be the closest node in N to u, and let v' be the 
closest node in ./V to v. Then d(u,u f ) < R(u,e) < d(u,v) by the definition of e-far and e-density 
nets, d(v,u') < d(v,u) + d(u,u') < 2d(u,v), and thus d(u,u') + d(v,u') < 3d(u,v). This means 
that if every vertex keeps as its sketch its distance from all nodes in N, we will have sketches 
with e-uniform slack and stretch 3 (to compute an approximation to d(u, v) we can just consider 
use min w£ N{d(u, w) + d(w,v)}, which we can compute from the two sketches). The size of these 
sketches is clearly 0(|iV|) = 0(^ logn), since for every node in N we just need to store its ID and 
its distance. 

It just remains to show how to compute these sketches efficiently. But this is simple, since it is 
exactly the /c-Source Shortest Paths problem where the sources are the nodes in N. So we simply 
run the fc-source version of Distributed Bellman-Ford, which gives the claimed time and message 
complexity bounds. □ 

We can get a different tradeoff by applying Thorup and Zwick to the density net itself, in- 
stead of simply having every node remember its distance to all net nodes. This is essentially 



what is done in the slack labeling schemes of Chan et al. [CDG06], just with slightly differ- 
ent parameters and constructions (since they were able to use centralized constructions). In 
particular, suppose that we manage to use Thorup-Zwick on the net, so the distances between 
net points are preserved up to stretch 2k — 1. Then these sketches would have size at most 
0(k\N\ 1 / k log n) = 0(k (Mogn) 1//fc logn). For each u E V, let v! € N be the closest node in 
the density net to u. We let the sketch of u be the identity of u', the distance between u and u', 
and the Thorup-Zwick label of u'. We call this the (e,k)-CDG sketch. Clearly this sketch has size 

0(k Q logn) logn). Let u, v £ V such that v is e-far from u. Our estimate of the distance will 
be d(u,u') + d"(u',v') + d(v',v), where d"(u',u') < (2k — l)d(u',v') is the approximate distance 
given by the Thorup-Zwick labels. This can obviously be computed given the sketches for u and v. 
To bound the stretch, we simply use the definition of density nets, the triangle inequality, and the 
fact that v is e-far from u. This gives us a distance estimate d'(u,v) with 

d'(u, v) = d(u, v!) + d"(u', v 1 ) + d(v',v) 

< d(u, v!) + (2k - l)d(u\ v') + d(v\ v) 

< d(u, v) + (2k - l)(d(u, u) + d(u, v) + d(v, v')) 

+ 2d(u,v) 

< 3d(u,v) + (2k - l)(4d(u,v)) 
= (8k - l)d(u,v) 

This gives the basic lemma about these sketches, which was proved by [ PDG06| ] (modulo our 
modifications to density nets): 



Lemma 4.4 Q CDG06 1). For any e > and 1 < k < 0(log with high probability the (e, k)-CDG 

/ 1 \ l/fc 

sketch has size at most 0(k (-logn) logn) words and (8k — 1) -stretch with e-slack. 

It remains to show how to construct (e, fe)-CDG sketches in a distributed manner, which boils 
down to modifying the algorithm of Theorem |3.8| to work with the density net rather than with the 



12 



full point set (note that this is trivial in a centralized setting since we can just consider the metric 
completion) . 

Lemma 4.5. For any e > and 1 < k < O(^), there is a distributed algorithm so that after 
O (kS Q logn) 1 ^ logn J rounds and O ( kS\E\ (Mogn) 1 ^ logn) messages every node knows its 



(e,k)-CDG sketch 



Proof. We first apply Lemma 4.2 to construct the e-density net N. We now want every node u 
to know its closest net node u' and its distance from u' . This can be done via a single use of 
Distributed Bellman-Ford, where we just imagine a "super node" consisting of all of N. This takes 
O(S) rounds and 0(S\E\) messages. 

Now we need to run Thorup-Zwick on N. But this is easy to do, since we just modify the A{ sets 
to be subsets of N instead of V and change the sampling probability from n _1//fc to f^p Inn) . 
Note that for every node u (zV the bunch Bi(u) is still well defined, and with high probability has 

size at most O u ^ log n) log n^j (via an argument analogous to Lemma |3.6| ). This means that we 
can run Algorithm [2] using these new Aj sets and every node will know their Thorup-Zwick sketch 
for these Ai sets. In particular, the nodes in N will have a sketch that is exactly equal to the sketch 
they would have if we ran Algorithm |2| on the metric completion of N, rather than on G. It is 
easy to see that Lemma |3.7| still applies but with n l l k log n (the upper bound on the size of each 
Bi(u)) changed to O u~ log n) 1 ^ logn^ , so each phase takes O \ S Q log n) 1 ^ log n^j rounds and 

0(S\E\ (ilogn) lA 'log n^j messages. Since there are k phases, this gives the desired complexity 
bounds. 

As before, this assumes that every node knows S in order to synchronize the phases. However, 



we can remove this assumption by using the termination detection algorithm of Section 3.3. This 
at most doubles the number of messages and rounds and adds an extra 0(\E \ logn) messages and 
0(D) rounds, which is negligible. □ 



Combining Lemmas [LJ and |4.5| gives us the following theorem. 

Theorem 4.6. For any e > and 1 < k < 0(log -), there is a distributed sketching algorithm that 
completes in at most O (kS (^ logn) log n^j rounds and O (^kS\E\ (^ logn) log nj messages, 

after which with high probability every node has a sketch of size at most 0(k (^ log n) log n) words 
that provides approximate distances with stretch 8k — 1 and e-slack. 



4.1 Gracefully Degrading Sketches and Average Stretch 

We now show how to use Theorem |4.6| to construct sketches with bounded average stretch, as well 
as bounded worst-case stretch. Formally, suppose that we have a weighted graph G = (V, E) that 
induces the metric d and a sketching algorithm that allows us to compute distance estimates d' 
with the property that d'(u,v) > d(u,v) for all u,v 6 V. The average stretch of the sketching 
algorithm is E {u , v}e ( v 2 ) 

In fact, we will prove a stronger statement, that there are good distributed algorithms for 
computing gracefully degrading sketches. A sketching algorithm is gracefully degrading with /(e) 
stretch if for every e € (0, 1) it is a sketch with stretch /(e) and e-slack. In other words, instead 
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of specifying e ahead of time (as in the slack constructions) we need a single sketch that works 
simultaneously for every e. It is easy to see that when / is 0(log -), gracefully degrading sketches 



provide the desired average and worst-case stretch bounds (this was implicit in Chan et al. [CDG06], 
but they only formally showed this for their specific gracefully-degrading construction, which is 
slightly different than ours): 

Lemma 4.7. Any gracefully degrading sketching algorithm with 0(log -) stretch has stretch at most 
O(logn) and average stretch at most 0(1). 

Proof. The bound on the worst-case stretch is immediate by setting e < ^. With this setting of 
e, every two points are e-far from each other, and thus the stretch bound of O(log^) = O(logn) 
holds for all pairs. 

To bound the average stretch, for each 1 < i < log n and vertex u £ V let A(u, i) = B 1 ^^ (u) H 
(V \ B x l 2% (u)). In other words, A(u, i) is the set of points that are outside the smallest ball around 
u containing at least n/2* points, but inside the smallest ball around u containing at least n/2 l ~ 1 
points. Note that \A(u, i)\ = n/2 l . Furthermore, we can bound the stretch between u and any node 
in A(u,i) by 0(i), since when we set e = 1/2* we have a stretch bound of O(log^) = 0{i) for the 
nodes in A(u, i). Then the average stretch is at most 

d'(u,v) 1 d'(u,v) 



1 x - d (u,v) 1 \ " \ " 

Tfj ^ v d{u,v) - n(n- 1) *-f r 2f d{u, v) 



< 



1 lo S n All \ 

1 \ - x - x - a{u,v) 
n(n-l)^^i ^ d(u,v) 



log n 
' u€V 1=1 



< 



' uev 
<0(1), 

proving the lemma. □ 

This lemma reduces the problem of constructing sketches with good average stretch to the 
problem of constructing gracefully degrading sketches. But this turns out to be simple, given 



Theorem 4.6. The intuition behind gracefully degrading sketches is that they work simultaneously 
for every slack parameter e, so to create them we simply use O(logn) different sketches with slack, 
one for each power of 2 between \ jn and 1. 

Theorem 4.8. There is a distributed gracefully degrading sketching algorithm that gives sketches 
of size at most 0(log 4 n) words with O (log-) -stretch that completes in at most 0(S log 4 n) rounds 
and at most 0(S\E\ log 4 n) messages. 



Proof. Our construction is simple: for every 1 < i < logn we use Theorem 4.6 with slack = i 



and stretch k = 0(log — ) = 0(log 2 l ). The sketch remembered by a node is just the union of these 
O(logn) sketches. Given the sketches for two different vertices u and v where v is e-far from u, 
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we can compute the O(logn) different distance estimates and take the minimum of them as our 
estimate. 

To see that this is gracefully degrading with stretch 0(log -), first note that all of the O(logn) 
estimates are at least as large as d(u, v), so we just need to show that at least one of the estimates is 
at most 0(log \)d{u, v). Let be e rounded down to the nearest power of 1/2. Then v is obviously 
Q-far from u, so the estimate for the e^-sketch will provide an estimate of at most 0(log — )d(u, v) = 

0(log±)d(u,u) 



Theorem 4.6, when specialized to the case of k = 0(log -), completes in at most O^log - log 



2 



n) 

rounds and 0(S\E\ log ~ log 2 n) messages and gives sketches of size 0(log ^ log 2 n). Since we just 
run each of the O(logn) instantiations of the theorem back to back, the total number of rounds is 
at most 0(5 log 2 n) Y^i=i l°g 2* = 0(S log 4 n), the number of messages is at most 0(S\E\ log 4 n), 
and the size is at most 0(log 4 ra). Note that we can handle determination detection for each of 
these as usual, based on Section [0. □ 



Together with Lemma 4.7, this gives the following corollary: 



Corollary 4.9. There is a distributed sketching algorithm that give sketches of size at most 
0(log 4 n) with O (log n)- stretch and O(l) average stretch that completes in at most 0(5 log n) 
rounds and at most 0(S\E\ log 4 n) messages. 



Note that, when compared to our sketch from Theorem 3.8 with O(logn) stretch, we pay only 
an extra 0(log 2 n) factor in the size of the sketch as well as the number of rounds and messages, 
and in return we are able to achieve constant average stretch. 

5 Conclusions 

In this paper we initiated the study from a theoretical point of view of distributed algorithms for 



computing distance sketches in a network. We showed that the Thorup-Zwick distance sketches [fTZ05 
which provide an almost-optimal tradeoff between the size of the sketches and their accuracy, can 
be computed efficiently in a distributed setting, where our notion of efficiency is the standard def- 
inition of the number of rounds in the CONGEST model. Combining this distributed algorithm 
with centralized techniques of Chan et al. [CDG06], that we were also able to turn into efficient 



distributed algorithms, yielded a combined construction with the same worst-case stretch as the 
smallest version of Thorup-Zwick, but much better average stretch. This required only a polyloga- 
rithmic cost in the size of the sketches and the time necessary to construct them. These results are 
a first step towards making the theoretical work on distance sketches more practical, by moving 
from a centralized setting to a distributed setting. It would be interesting in the future to weaken 
the distributed model even further, by working in failure-prone and asynchronous settings, in the 
hope of eventually getting practical distance sketches with provable performance guarantees. 
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